Given the results of the analysis in fluidigm.Rmd, here I explore the role of the V intercept in the model.
One hypothesis is that this intercept acts as a size factor. Is it true? And if so, Will the data be explained completely by only the intercept and one factor W?
There are four possible models, if we keep X fixed: V present in both \(\mu\) and \(\pi\), only in \(\mu\), only in \(\pi\) and in neither.
\(\gamma_pi\) captures the detection rate, while \(\gamma_mu\) captures the coverage (sequencing depth). Since the second component of W is also highly correlated with coverage, the model can have either \(\gamma_mu\) or two components of W. We need to figure out if this is a general feature or if it’s this particular dataset that exhibit confounding between biology and coverage.
Since we didn’t see much difference between the high and low coverage data, I will focus on the latter. I will not include any normalization / offset.
Select low coverage cells, filter out the genes that do not have at least 10 counts in at least 10 cells.
data("fluidigm")
fluidigm_low <- fluidigm[,which(colData(fluidigm)$Coverage_Type=="Low")]
filter <- apply(assay(fluidigm_low)>10, 1, sum)>=10
Number of retained genes:
print(sum(filter))
## [1] 2506
fluidigm_low <- fluidigm_low[filter,]
low <- assay(fluidigm_low)
First, let’s look at PCA (of the log counts) for reference.
bio <- as.factor(colData(fluidigm_low)$Biological_Condition)
cl <- as.factor(colData(fluidigm_low)$Cluster2)
detection_rate <- colSums(low>0)
coverage <- colSums(low)
pca <- prcomp(t(log1p(low)))
plot(pca$x, col=cols[bio], pch=19, main="PCA of log-counts, centered not scaled")
legend("topleft", levels(bio), fill=cols)
print(system.time(zinb_Vall <- zinbFit(low, ncores = 3, K = 2)))
## Warning in simpute.als(x, J, thresh, lambda, maxit, trace.it, warm.start, :
## Convergence not achieved by 100 iterations
## user system elapsed
## 409.710 66.880 182.534
Plot the results with cells colored according to their biological condition.
plot(zinb_Vall@W, col=cols[bio], pch=19, xlab="W1", ylab="W2", main="Both V intercepts")
legend("topleft", levels(bio), fill=cols)
One interpretation of the model is that \(\gamma_mu\) will act as a “size factor”. Here, we check whether it captures sequencing depth and/or detection rate.
#total number of detected genes in the cell
df <- data.frame(gamma_mu = zinb_Vall@gamma_mu[1,], gamma_pi = zinb_Vall@gamma_pi[1,], detection_rate=detection_rate, coverage=coverage)
pairs(df, pch=19, col=cols[bio])
print(cor(df, method="spearman"))
## gamma_mu gamma_pi detection_rate coverage
## gamma_mu 1.0000000 0.8181818 -0.7086552 0.9677885
## gamma_pi 0.8181818 1.0000000 -0.9484577 0.7143357
## detection_rate -0.7086552 -0.9484577 1.0000000 -0.6348329
## coverage 0.9677885 0.7143357 -0.6348329 1.0000000
\(\gamma_pi\) clearly captures the detection rate and \(\gamma_mu\) clearly captures the coverage. Interestingly, the two estimates are positively correlated.
par(mfrow=c(1, 2))
plot(rowMeans(getPi(zinb_Vall)), detection_rate, xlab="Average estimated Pi", ylab="detection rate", pch=19, col=cols[bio])
plot(rowMeans(log1p(getMu(zinb_Vall))), coverage, xlab="Average estimated log Mu", ylab="coverage", pch=19, col=cols[bio])
It is worth looking at the other estimates as well, namely, \(\beta\) and \(\alpha\).
par(mfrow=c(2, 2))
boxplot(zinb_Vall@beta_mu[1,], main="Beta_Mu")
boxplot(zinb_Vall@beta_pi[1,], main="Beta_Pi")
plot(t(zinb_Vall@alpha_mu), xlab="Alpha_Mu 1", ylab="Alpha_Mu 2", main="Alpha_Mu")
plot(t(zinb_Vall@alpha_pi), xlab="Alpha_Pi 1", ylab="Alpha_Pi 2", main="Alpha_Pi")
It seems that only a few genes drive the estimate of W.
One possibility is that, once we remove coverage, the data can be explained by only one factor. If that’s the case, plotting \(\gamma_mu\) vs. \(W_2\) should recover the PCA plot.
plot(zinb_Vall@gamma_mu[1,], zinb_Vall@W[,2], col=cols[bio], pch=19, xlab="Gamma_Mu", ylab="W2", main="Both V intercepts")
legend("bottomright", levels(bio), fill=cols)
plot(zinb_Vall@gamma_pi[1,], zinb_Vall@W[,2], col=cols[bio], pch=19, xlab="Gamma_Pi", ylab="W2", main="Both V intercepts")
legend("bottomright", levels(bio), fill=cols)
print(system.time(zinb_Vnone <- zinbFit(low, ncores = 3, K = 2, V=matrix(0, ncol=1, nrow=NROW(low)))))
## user system elapsed
## 156.161 27.808 69.145
plot(zinb_Vnone@W, col=cols[bio], pch=19, xlab="W1", ylab="W2")
legend("bottomright", levels(bio), fill=cols)
df <- data.frame(W1 = zinb_Vnone@W[,1], W2 = zinb_Vnone@W[,2], detection_rate=detection_rate, coverage=coverage)
pairs(df, pch=19, col=cols[bio])
print(cor(df, method="spearman"))
## W1 W2 detection_rate coverage
## W1 1.00000000 -0.02478147 0.9038976 -0.6357955
## W2 -0.02478147 1.00000000 0.1709846 -0.6683566
## detection_rate 0.90389764 0.17098463 1.0000000 -0.6348329
## coverage -0.63579545 -0.66835664 -0.6348329 1.0000000
par(mfrow=c(1, 2))
plot(rowMeans(getPi(zinb_Vnone)), detection_rate, xlab="Average estimated Pi", ylab="detection rate", pch=19, col=cols[bio])
plot(rowMeans(log1p(getMu(zinb_Vnone))), coverage, xlab="Average estimated log Mu", ylab="coverage", pch=19, col=cols[bio])
It is worth looking at the other estimates as well, namely, \(\beta\) and \(\alpha\).
par(mfrow=c(2, 2))
boxplot(zinb_Vnone@beta_mu[1,], main="Beta_Mu")
boxplot(zinb_Vnone@beta_pi[1,], main="Beta_Pi")
plot(t(zinb_Vnone@alpha_mu), xlab="Alpha_Mu 1", ylab="Alpha_Mu 2", main="Alpha_Mu")
plot(t(zinb_Vnone@alpha_pi), xlab="Alpha_Pi 1", ylab="Alpha_Pi 2", main="Alpha_Pi")
V <- cbind(rep(0, NROW(low)), rep(1, NROW(low)))
print(system.time(zinb_Vmu <- zinbFit(low, ncores = 3, K = 2, V=V, which_V_mu=2L, which_V_pi=1L)))
## Warning in simpute.als(x, J, thresh, lambda, maxit, trace.it, warm.start, :
## Convergence not achieved by 100 iterations
## Warning in optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu, logitPi =
## logitPi, : NA/Inf replaced by maximum positive value
## user system elapsed
## 464.293 70.466 203.342
plot(zinb_Vmu@W, col=cols[bio], pch=19, xlab="W1", ylab="W2")
legend("topleft", levels(bio), fill=cols)
df <- data.frame(W1 = zinb_Vmu@W[,1], W2 = zinb_Vmu@W[,2], gamma_mu = zinb_Vmu@gamma_mu[1,], detection_rate=detection_rate, coverage=coverage)
pairs(df, pch=19, col=cols[bio])
print(cor(df, method="spearman"))
## W1 W2 gamma_mu detection_rate coverage
## W1 1.0000000 -0.5879808 -0.5746066 0.8053149 -0.5478584
## W2 -0.5879808 1.0000000 0.3079108 -0.6664554 0.2897727
## gamma_mu -0.5746066 0.3079108 1.0000000 -0.6754155 0.9843531
## detection_rate 0.8053149 -0.6664554 -0.6754155 1.0000000 -0.6348329
## coverage -0.5478584 0.2897727 0.9843531 -0.6348329 1.0000000
par(mfrow=c(1, 2))
plot(rowMeans(getPi(zinb_Vmu)), detection_rate, xlab="Average estimated Pi", ylab="detection rate", pch=19, col=cols[bio])
plot(rowMeans(log1p(getMu(zinb_Vmu))), coverage, xlab="Average estimated log Mu", ylab="coverage", pch=19, col=cols[bio])
It is worth looking at the other estimates as well, namely, \(\beta\) and \(\alpha\).
par(mfrow=c(2, 2))
boxplot(zinb_Vmu@beta_mu[1,], main="Beta_Mu")
boxplot(zinb_Vmu@beta_pi[1,], main="Beta_Pi")
plot(t(zinb_Vmu@alpha_mu), xlab="Alpha_Mu 1", ylab="Alpha_Mu 2", main="Alpha_Mu")
plot(t(zinb_Vmu@alpha_pi), xlab="Alpha_Pi 1", ylab="Alpha_Pi 2", main="Alpha_Pi")
print(system.time(zinb_Vpi <- zinbFit(low, ncores = 3, K = 2, V=V, which_V_mu=1L, which_V_pi=2L)))
## user system elapsed
## 129.919 21.942 58.287
plot(zinb_Vpi@W, col=cols[bio], pch=19, xlab="W1", ylab="W2")
legend("bottomright", levels(bio), fill=cols)
df <- data.frame(W1 = zinb_Vpi@W[,1], W2 = zinb_Vpi@W[,2], gamma_pi = zinb_Vpi@gamma_pi[1,], detection_rate=detection_rate, coverage=coverage)
pairs(df, pch=19, col=cols[bio])
print(cor(df, method="spearman"))
## W1 W2 gamma_pi detection_rate
## W1 1.00000000 0.01088287 -0.7012675 0.7671580
## W2 0.01088287 1.00000000 -0.3999563 0.3585782
## gamma_pi -0.70126748 -0.39995629 1.0000000 -0.9832927
## detection_rate 0.76715801 0.35857819 -0.9832927 1.0000000
## coverage -0.51459790 -0.80729895 0.6306818 -0.6348329
## coverage
## W1 -0.5145979
## W2 -0.8072990
## gamma_pi 0.6306818
## detection_rate -0.6348329
## coverage 1.0000000
par(mfrow=c(1, 2))
plot(rowMeans(getPi(zinb_Vpi)), detection_rate, xlab="Average estimated Pi", ylab="detection rate", pch=19, col=cols[bio])
plot(rowMeans(log1p(getMu(zinb_Vpi))), coverage, xlab="Average estimated log Mu", ylab="coverage", pch=19, col=cols[bio])
It is worth looking at the other estimates as well, namely, \(\beta\) and \(\alpha\).
par(mfrow=c(2, 2))
boxplot(zinb_Vpi@beta_mu[1,], main="Beta_Mu")
boxplot(zinb_Vpi@beta_pi[1,], main="Beta_Pi")
plot(t(zinb_Vpi@alpha_mu), xlab="Alpha_Mu 1", ylab="Alpha_Mu 2", main="Alpha_Mu")
plot(t(zinb_Vpi@alpha_pi), xlab="Alpha_Pi 1", ylab="Alpha_Pi 2", main="Alpha_Pi")
Here, I try to fit a model without W, to see if detection rate and coverage are enough to completely explain the data.
print(system.time(zinb_noW <- zinbFit(low, ncores = 3)))
## user system elapsed
## 74.157 19.171 38.160
plot(zinb_noW@gamma_mu, zinb_noW@gamma_pi, col=cols[bio], pch=19, xlab="Gamma_Mu", ylab="Gamma_Pi")
legend("bottomright", levels(bio), fill=cols)
print(cor(zinb_noW@gamma_mu[1,], zinb_noW@gamma_pi[1,], method="spearman"))
## [1] 0.8332605